Skip to content

Conversation

@gflarity
Copy link
Contributor

@gflarity gflarity commented Jan 19, 2026

What type of PR is this?

/kind feature
/kind bug
/kind api

What this PR does / why we need it:

This PR fixes and improves gang termination behavior in Grove with several key changes:

  1. Gang termination is now opt-in - terminationDelay no longer defaults to 4 hours. When nil, gang termination is disabled entirely for the PodCliqueSet.

  2. Only ready pods count for breach detection - Previously breach calculation considered scheduled pods; now it only counts ready pods, which more accurately reflects actual workload availability.

  3. PCSG-level terminationDelay override - Individual PodCliqueScalingGroups can now override the PCS-level termination delay, allowing finer-grained control over gang termination timing.

  4. Bug fix: nil pointer when base gang doesn't exist - Fixed a crash when the base gang no longer exists after gang termination.

Which issue(s) this PR fixes:

Fixes #277

Special notes for your reviewer:

  • Start with the design doc at docs/designs/gang-termination.md - it provides a high-level overview of the gang termination feature after these changes. Happy to update the code alongside any design doc feedback.
  • Comprehensive E2E tests have been added covering multiple gang termination scenarios (GT1-GT5)
  • Validation ensures PCSG-level terminationDelay can only be set when PCS-level terminationDelay is set

Does this PR introduce a API change?

Gang termination is now opt-in. The `terminationDelay` field no longer defaults to 4 hours - when not set, gang termination is disabled. To enable gang termination, explicitly set `spec.template.terminationDelay` on PodCliqueSet.  Administrators who were expecting Gang Termination at the PodCliqueSet or PodCliqueScalingGroup levels after 4 hours should update their PodCliqueSets according. 

Added `WasAvailable` status field to PodClique to track whether a workload has ever reached its MinAvailable threshold. Gang termination only triggers for workloads that were previously available, preventing termination during initial creation and startup.

Added `terminationDelay` field to PodCliqueScalingGroupConfig allowing per-PCSG override of the PCS-level termination delay.

Additional documentation e.g., enhancement proposals, usage docs, etc.:

docs/designs/gang-termination.md - Comprehensive design document explaining gang termination behavior, configuration, and debugging

@gflarity gflarity added the kind/bug Categorizes issue or PR as related to a bug. label Jan 19, 2026
@gflarity gflarity added kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/enhancement Categorizes issue or PR as related to a new feature, enhancement or improvement labels Jan 19, 2026
Copy link

@shayasoolin shayasoolin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job!

allErrs = append(allErrs, field.Forbidden(fldPath.Index(i).Child("terminationDelay"), "terminationDelay can only be set on PodCliqueScalingGroupConfig when PodCliqueSetTemplateSpec.terminationDelay is set (gang termination is enabled)"))
} else if scalingGroupConfig.TerminationDelay.Duration <= 0 {
allErrs = append(allErrs, field.Invalid(fldPath.Index(i).Child("terminationDelay"), scalingGroupConfig.TerminationDelay, "terminationDelay must be greater than 0"))
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and another else-if case, to validate that the PCSG-specific termination delay is not greater than the PCS one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See https://docs.google.com/document/d/11RvFMS1l5RH_FY54G6wd1Q0wSJ2RCobiPRAs2vAhzgM/edit?disco=AAAByqL0jIQ. I discussed with @nvrohanv and @athreesh and while I don't see why someone would do this, we're not going to babysit this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @shayasoolin. PCSG termination delay should not be more than PCS. We need to provide API where results are deterministic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's chat with @nvrohanv and @athreesh. I'm ambivalent tbh, but we did a final pass on the requirements and this was the call.

Copy link
Contributor

@nvrohanv nvrohanv Jan 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the reason why I don't think we should babysit this is we assume if you choose to do gang termination you know what you are doing. In that scenario I can understand someone being more tolerant if out of 3 pcsg one is down they give it a longer amount of time to recover than if the pcs minAvailable is breached then they have less tolerance because the system is not functional. Since theres a valid use case for setting it to be more and we assume this is a power-user feature I think we should provide full flexibility.

@athreesh
Copy link

LGTM!

  • How will existing users be notified that terminationDelay default changed from 4h to disabled? Should release notes include migration guidance?

  • Do we need user-facing documentation in docs/user-guide/? I guess I see it in operator/api-reference so should be fine

athreesh
athreesh previously approved these changes Jan 21, 2026
@gflarity
Copy link
Contributor Author

  • How will existing users be notified that terminationDelay default changed from 4h to disabled? Should release notes include migration guidance?

Good question, I've just updated the release notes section in the PR description with some guidance. I believe this will be included in the release notes when we cut a release? CC: @sanjaychatterjee.

@gflarity
Copy link
Contributor Author

  • Do we need user-facing documentation in docs/user-guide/? I guess I see it in operator/api-reference so should be fine

Maybe? I'll take a stab at adding something. I believe @nvrohanv is working on a user guide PR, I might just wait for that to merge then follow the style/approach there.

@sanjaychatterjee
Copy link
Collaborator

I will review it this week.

// TerminationDelay overrides the PodCliqueSet-level terminationDelay for this scaling group.
// Can only be set if PodCliqueSetTemplateSpec.TerminationDelay is set (gang termination is enabled).
// When set, this value is used instead of the PodCliqueSet-level terminationDelay for gang termination
// decisions affecting this scaling group's replicas.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need validation that this value should be less than or equal to the PCS one?

allErrs = append(allErrs, field.Forbidden(fldPath.Index(i).Child("terminationDelay"), "terminationDelay can only be set on PodCliqueScalingGroupConfig when PodCliqueSetTemplateSpec.terminationDelay is set (gang termination is enabled)"))
} else if scalingGroupConfig.TerminationDelay.Duration <= 0 {
allErrs = append(allErrs, field.Invalid(fldPath.Index(i).Child("terminationDelay"), scalingGroupConfig.TerminationDelay, "terminationDelay must be greater than 0"))
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @shayasoolin. PCSG termination delay should not be more than PCS. We need to provide API where results are deterministic.

// for each PCSG (PCSG override if set, otherwise PCS default).
// It returns the names of all such PodCliqueScalingGroups and minimum of all the waitDurations.
func getMinAvailableBreachedPCSGInfo(pcsgs []grovecorev1alpha1.PodCliqueScalingGroup, terminationDelay time.Duration, since time.Time) ([]string, time.Duration) {
func getMinAvailableBreachedPCSGInfoWithEffectiveDelay(pcsgs []grovecorev1alpha1.PodCliqueScalingGroup, pcs *grovecorev1alpha1.PodCliqueSet, since time.Time) ([]string, time.Duration) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a unit test?


// findMatchingPCSGConfig finds the PodCliqueScalingGroupConfig that matches the given PCSG.
// It returns nil if no matching config is found.
func findMatchingPCSGConfig(pcs *grovecorev1alpha1.PodCliqueSet, pcsg *grovecorev1alpha1.PodCliqueScalingGroup) *grovecorev1alpha1.PodCliqueScalingGroupConfig {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a unit test?

// This field is used for gang termination: a PodClique can only be considered in breach of
// MinAvailable if it was previously available (WasAvailable=true).
// +kubebuilder:default=false
WasAvailable bool `json:"wasAvailable"`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this in the API? Wouldn't an update event tell us if the old version of the object was available but the new version is not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mentioned this in the "GREP":

Motivation for WasAvailable:

  1. Ignore Initial Startup: New PodCliques should not be terminated simply because they haven't yet reached their MinAvailable threshold. The WasAvailable gate ensures gang termination only triggers for workloads that were previously healthy and then became degraded—not for workloads that are still starting up.

  2. Operator Crash Resilience: The WasAvailable flag is persisted in the PodClique's status (not held in memory). This ensures that if the Grove operator crashes and restarts during the transition to availability, the flag is not lost. Because it's a sticky bit that never reverts to false, the operator can safely restart at any time without missing the transition or incorrectly treating a previously-available PodClique as never-available.

I'll go into a bit more detail in the doc, but we do need to persist this information for a few reasons:
• If the operator crashes and restarts, we can't reliably infer whether a PodClique was ever available just from the conditions. The UpdateInProgress reason doesn't tell us either way, so we'd be guessing.
• There's also a race condition where pods could briefly reach MinAvailable then die before we reconcile. Without the sticky bit, we'd lose that info and wrongly treat it as "never available."

I also considered just adding another condition, but decided against it as it follows the Kubernetes pattern/convention of using status fields for internal bookkeeping (like observedGeneration) rather than conditions. Conditions are more for user-facing state they'd alert on it. So ultimately I went with the separate status boolean.

Copy link
Contributor Author

@gflarity gflarity Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update - After going through the edge case below, I do see that if we respect the previous MinAvailableBreached=True on operator restart (and perhaps all the time) and not require a "FLIP" it would solve this edge case. I'll revive the implementation and see if there are other addition edge cases that pop up. That said, WasAvailable definitely has advantages in terms of simplicity and being explicitly observable though.


For more background, here's the initial approach I tried and abandoned:

The PodClique controller watches "owned" Pods using a predicate. This triggers PodClique controller reconciliation when pods are deleted or have status changes (specifically when the Ready condition transitions). During this reconciliation, the "old" (current but to be reconciled) PodClique object is retrieved from the Kubernetes API. We then "Reconcile" and determine what the new values should be for status fields like PodClique.Status.ReadyReplicas. At this point we essentially have an "old" and "new" PodClique object, allowing us to notice changes in PodClique.Status.ReadyReplicas. Using this approach, we could avoid setting MinAvailableBreached to true unless oldPodClique.Status.ReadyReplicas >= minAvailable in the "old" object. This is insufficient given our requirements however. Consider the following edge case:

Common Starting Point

Time Event MinAvailableBreached Status Reason WasAvailable
T0 PCLQ created, 0 pods False InsufficientScheduledPods false
T1 3 pods scheduled, starting False InsufficientScheduledPods false

At T1, both scenarios are identical: 3 pods scheduled and starting.

Scenario A: Workload Becomes Available Then Degrades

Time Event MinAvailableBreached Status Reason WasAvailable
T2 All 3 pods become ready False SufficientReadyPods true
T3 1 pod dies (2/3 ready) True InsufficientReadyPods true
T4 Operator crashes - - -
T5 Operator restarts True InsufficientReadyPods ?

Should gang terminate? YES - workload was healthy, now degraded.

Scenario B: Workload Never Becomes Available

Time Event MinAvailableBreached Status Reason WasAvailable
T2 1 pod crashes during init True InsufficientReadyPods false
T3 Operator crashes - - -
T4 Operator restarts True InsufficientReadyPods ?

Should gang terminate? NO - workload never achieved availability.

State After Restart Comparison

Field Scenario A (T5) Scenario B (T4)
Status True True
Reason InsufficientReadyPods InsufficientReadyPods
ScheduledReplicas 3 3
ReadyReplicas 2 2
Should gang terminate? YES NO

The condition and status fields are identical. Without WasAvailable, the operator cannot distinguish between these scenarios.

@sanjaychatterjee
Copy link
Collaborator

Can we create a GREP for this work? It is important to highlight the reasoning behind this work. You can use issue #277 as the GREP number. For the new GREP template look at PR #362.

@sanjaychatterjee
Copy link
Collaborator

  • How will existing users be notified that terminationDelay default changed from 4h to disabled? Should release notes include migration guidance?

Good question, I've just updated the release notes section in the PR description with some guidance. I believe this will be included in the release notes when we cut a release? CC: @sanjaychatterjee.

Thanks for the release notes. Additionally, we need a GREP and usage docs as well.

@gflarity
Copy link
Contributor Author

gflarity commented Jan 24, 2026

Can we create a GREP for this work? It is important to highlight the reasoning behind this work. You can use issue #277 as the GREP number. For the new GREP template look at PR #362.

Oh I included gang-termination.md as a GREP and used the MNNVL doc as a template. I can take a look at the template PR follow that layout, sure.

@nvrohanv
Copy link
Contributor

  • Do we need user-facing documentation in docs/user-guide/? I guess I see it in operator/api-reference so should be fine

Maybe? I'll take a stab at adding something. I believe @nvrohanv is working on a user guide PR, I might just wait for that to merge then follow the style/approach there.

we do need the documentation but i can take the action item of adding that once this is merged in. In general I think the flow should be

  1. feature gets merged in and api documentation is updated as part of its pr
  2. docs/user-guide gets updated with proper guide in separate pr

nvrohanv
nvrohanv previously approved these changes Jan 24, 2026
@unmarshall unmarshall added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 25, 2026
Copy link
Collaborator

@unmarshall unmarshall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1/n reviews

@@ -0,0 +1,375 @@
# Gang Termination Design Doc
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

designs directory is going to removed. Please rework on this document and use the GREP template instead. Put the document under proposals. Before writing a GREP please also read docs/proposals/README.md

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you can just move this document out of this PR. Lets create a single document for gang scheduling and termination. This doc as it stands today needs a bit of work else it will hold your fix for no good reason.


# Overview

Gang termination is a mechanism in Grove that ensures when a component becomes unhealthy (falls below minimum availability threshold), the entire group (gang) is terminated rather than leaving it in a degraded state. This is particularly important for distributed AI inference workloads where partial availability often means the workload cannot function properly.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you define a component?
Ideally one should introduce what a Gang is, then talk about gang termination.

Gang termination is a mechanism in Grove

Need to change this as well as it is not a new concept introduced in Grove but a commonly known concept in gang scheduled workloads.


Gang termination is a mechanism in Grove that ensures when a component becomes unhealthy (falls below minimum availability threshold), the entire group (gang) is terminated rather than leaving it in a degraded state. This is particularly important for distributed AI inference workloads where partial availability often means the workload cannot function properly.

## Abbreviations
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are going to have this section, then we will have to repeat it for every proposal that is written, which i think is quite unnecessary. Instead create a separate document for commonly used abbreviations and all documents can refer to that.


| Abbreviation | Full Name | Description |
|--------------|-----------|-------------|
| PCS | PodCliqueSet | Grove CRD that manages a set of PodCliques and PodCliqueScalingGroups |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PCS does not manage PCLQ and PCSG, it is the controllers within Grove that manage these resources.

| Abbreviation | Full Name | Description |
|--------------|-----------|-------------|
| PCS | PodCliqueSet | Grove CRD that manages a set of PodCliques and PodCliqueScalingGroups |
| PCLQ | PodClique | Grove CRD representing a group of related pods |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PCLQ is a group of pods that share the same PodSpecTemplate and serve a single purpose.

- **Configurable Grace Period:** Allow time for Kubernetes scheduler to recover before terminating
- **Startup Protection:** Avoid terminating workloads that haven't yet reached full availability
- **Rolling Update Safety:** Prevent false terminations during expected pod churn
- **Multi-Level Granularity:** Support gang termination at PCLQ, PCSG, and PCS levels
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is gang termination at PCS level?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gang termination at PCS is whole PCS replica getting kicked back to scheduling.


**Limitations:**

- **Ready Pods Only (PCLQ level):** Only ready pods count toward availability—starting pods are not considered
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is this a limitation?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed this seems like a feature. Also its only once-ready pods right.


- **Ready Pods Only (PCLQ level):** Only ready pods count toward availability—starting pods are not considered
- **Sticky WasAvailable:** Once a PodClique has reached MinAvailable, the WasAvailable flag never reverts, even if the workload is later degraded and recovers
- **No Partial Recovery:** When gang termination triggers, the affected replica (PCS or PCSG) is deleted in its entirety, healthy PodCliques within that replica are not preserved.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is this a limitation?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also agreed, I believe this would only happen if minAvailable is breached at which point we want the whole thing triggered.

Individual `PodCliqueScalingGroupConfig` entries can override the PCS-level `terminationDelay`:

- If PCS-level `terminationDelay` is nil, gang termination is disabled for the entire PCS
- If PCS-level `terminationDelay` is set, each PCSG can optionally override it with its own delay
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did we choose to have this behavior? If we have TerminationDelay at PCS and PCSG level then it should be respected when defined and termination delay is enabled. The issue is that nil value of PCS termination delay has been taken as an indication that this feature has been enabled or disabled. This IMHO is not so nice and also not very intuitive.

From API design perspective when we define a delay at multiple levels, thus allowing overriding, then if PCS does not define TerminationDelay but this is defined at the PCSG level then it should be honored as long as this feature is enabled.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm it seems a bit odd to do it that way as well because since minAvailable defaults to 1 (if I remember correctly) then it seems like you should be getting gang semantics throughout. If anything I would lean towards validating that you cant set it on just a pcsg. Something we should discuss more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For PCLQ if minAvailable is not set then it is defaulted to pclq template replicas. For PCSG it defaults to 1. Gang semantics (scheduling and termination) applies to identified pod-gangs. Currently there are only 2 types of PodGangs:

Base PodGang
Comprises of:

  • minAvailable replicas of stand-alone PCLQs
  • minAvailable replicas of PCSG. Within PCSG if constituent PCLQ defines minAvailable then only those many pods.
    This means that only these many pods are required for a functional application. If the number goes below that and stays like that for terminationDelay seconds then its time to terminate the gang.

Scaled PodGang
Similar behavior applies to scaled PodGang where only minAvailable of constituent PCLQs are considered to determine if minAvailableBreached condition is true.

Now what happens when you do not define TerminationDelay - i believe we then need to have a default (very large) termination delay.

Consider the following PCS composition:

  • Standalone PCLQ - router
  • PCSG - comprises of decode & prefill PCLQs

Case #1
You define a terminationDelay of 1hr on PCS and 2hr on PCSG.
Now what does it mean for the base PodGang?
If router has minAvailableBreached set to true and it has already exceeded 1hr then it will gang terminate the base PodGang. A higher termination delay on PCSG would not come into play here.

What is the behavior for the scaled PodGang?
This is relatively simple. PCS termination delay will not come into effect. For all Scaled podgangs termination delay defined at the PCSG will only be considered.

Case #2
You define a terminationDelay of 2hr on PCS and 1hr on PCSG.
Now what does it mean for the base PodGang?
For base pod gang only PCS termination delay will apply. So if the PCSG PCLQs have their minAvailableBreached condition set for more than 1hr but less than 2hr, there will not be any gang termination. From the API perspective this behavior is utterly confusing.

So as soon as we introduce 2 level termination delay we should consider the behavior carefully.

Copy link
Collaborator

@unmarshall unmarshall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2/n review comments

Individual `PodCliqueScalingGroupConfig` entries can override the PCS-level `terminationDelay`:

- If PCS-level `terminationDelay` is nil, gang termination is disabled for the entire PCS
- If PCS-level `terminationDelay` is set, each PCSG can optionally override it with its own delay
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For PCLQ if minAvailable is not set then it is defaulted to pclq template replicas. For PCSG it defaults to 1. Gang semantics (scheduling and termination) applies to identified pod-gangs. Currently there are only 2 types of PodGangs:

Base PodGang
Comprises of:

  • minAvailable replicas of stand-alone PCLQs
  • minAvailable replicas of PCSG. Within PCSG if constituent PCLQ defines minAvailable then only those many pods.
    This means that only these many pods are required for a functional application. If the number goes below that and stays like that for terminationDelay seconds then its time to terminate the gang.

Scaled PodGang
Similar behavior applies to scaled PodGang where only minAvailable of constituent PCLQs are considered to determine if minAvailableBreached condition is true.

Now what happens when you do not define TerminationDelay - i believe we then need to have a default (very large) termination delay.

Consider the following PCS composition:

  • Standalone PCLQ - router
  • PCSG - comprises of decode & prefill PCLQs

Case #1
You define a terminationDelay of 1hr on PCS and 2hr on PCSG.
Now what does it mean for the base PodGang?
If router has minAvailableBreached set to true and it has already exceeded 1hr then it will gang terminate the base PodGang. A higher termination delay on PCSG would not come into play here.

What is the behavior for the scaled PodGang?
This is relatively simple. PCS termination delay will not come into effect. For all Scaled podgangs termination delay defined at the PCSG will only be considered.

Case #2
You define a terminationDelay of 2hr on PCS and 1hr on PCSG.
Now what does it mean for the base PodGang?
For base pod gang only PCS termination delay will apply. So if the PCSG PCLQs have their minAvailableBreached condition set for more than 1hr but less than 2hr, there will not be any gang termination. From the API perspective this behavior is utterly confusing.

So as soon as we introduce 2 level termination delay we should consider the behavior carefully.

Co-authored-by: Madhav Bhargava <madhav.bhargava@sap.com>
Signed-off-by: Geoff Flarity <geoff.flarity@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/bug Categorizes issue or PR as related to a bug. kind/enhancement Categorizes issue or PR as related to a new feature, enhancement or improvement

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Gang Termination Doesn't Work

6 participants